HP Confidential Hewlett-Packard Company Version: July 24, 1992 Netware/iX Performance Patch LAN Labs Benchmark Results Version: July 24, 1992 Summary This report summarizes the performance testing of the Netware/iX Performance Patch using the LAN Labs Benchmarks from PC Magazine. These results are correlated with the previous testing on earlier versions of Netware/iX, Netware/9000, and Netware running on Intel platforms. Results Overall there have been major improvements in Netware/iX's file sharing performance which met the goals set for this product. In the single user case (called "no load" in the accompanying tables and graphs) Netware/iX is able to deliver 81% of the performance of a Netware 3.11 server running on a Vectra 486/25. CPU utilization was at 5%. The best results however are seen under load conditions. As benchmark clients were added to Netware/iX running on a 917 it degraded more slowly than any other platform. At five benchmark client loads it was is the best Netware server tested. Specifically it was marginally better than the 486 and 386 servers, 28% better than the HP9000 720, and 265% better than the HP9000 807 running Netware/9000. Your Mileage May Vary These results are duplicable on the test platform described later in this document, however, the reader must be aware that a specific user environments may see either better or worse results than what is described here based on what the user is doing and the environment on the system. The interpretation section below should help in understanding what conditions may lead to different results. The improvements in Netware/iX will only be available on an NIO system. The best performance will be seen on NOVA class SPUs. Put All Your Eggs in One Basket, and Watch That Basket The goal of the Netware/iX Performance Patch was to improve the file server performance, specifically the read/write path, since the vast majority of user work is done here. The SPX IPC API, the printer spooler, and all directory functions were improved some, but not to the degree of the read/write path improvements. Measured Systems The following tables and graphs contain measurements from previous tests as well as the results being reported here. The entry "917 FP" indicates Netware/iX with the "Fast Path" code turned on. This is the Netware/iX Performance Patch software. Previous tests using the "917 SP" and "967 SP" were done earlier this year and used the "Slow Path" code prior to the performance patch software. These results are reported to show the improvement of the Fast Path code. For comparison, the benchmarks were run against a 386 and a 486 running Netware 3.11, the current product from Novell, as well as against the HP9000 807 (which uses the same processor board as the 917) and the 937 (which uses the Nova 2.3 processor board). The number of loads was from a single measurement system (the "no load" case), to the measurement system and ten additional loads. Throughput Since throughput is usually the first thing we are asked about, here are the numbers. There are several things which can be seen here. First, as mentioned above, in the single user case (call "no load") the performance of Netware/iX in this benchmark is close to the 486, 386, and HP9000 720, while it is better than the HP900 837, 817, and 807 (which uses the same processor board as the 917). The better news is that at five loads Netware/iX out performs the 720, and marginally exceeds the 486. In the accompanying tables, the systems tested are listed in order of best performance for the five load case. "FP" stands for fast path, or the Netware/iX Performance Patch, and "SP" stands for slow path or the Netware/iX product prior to the performance enhancements. LAN Labs PC Benchmark Tests Throughput (KBytes/sec) 917 FP 486/25 386/25 720 837 817 807 917 SP No Load 283 349 329 366 251 219 185 115 1 256 315 292 304 215 184 159 72 2 241 288 250 234 186 162 121 47 3 234 250 238 238 167 146 92 34 4 222 226 210 224 151 133 72 27 5 219 211 198 171 141 131 60 22 6 218 165 135 113 10 204 159 121 Italicized Entries are Extrapolations CPU Utilization The observed CPU utilization is documented in the following table and graph. In the table the systems tested are listed in order of best performance for the five load case. The important item to note is that the 917 has lower CPU utilization than the 807 which is its peer with regards to processor board. The only metric where the 837 out performed the 917FP was in CPU utilization, which is to be expected since the 837 uses the Nova 2.3 SPU as compared to the 1.0 SPU of the 917 and 807. The very low CPU utilization of the 486 under load indicates that for the platform being tested that there is a bottleneck other than the CPU which is limiting server performance. LAN Labs PC Benchmark Tests CPU Utilization 486/25 837 386/25 917 FP 817 720 807 917 SP No Load 28% 20% 50% 5% 33% 43% 50% 80% 1 29% 38% 56% 29% 55% 66% 75% 100% 2 30% 50% 62% 55% 78% 82% 98% 100% 3 31% 58% 68% 63% 85% 88% 100% 100% 4 32% 70% 73% 77% 92% 90% 100% 100% 5 33% 72% 80% 86% 95% 97% 100% 100% 6 72% 88% 99% 99% 10 77% 100% 100% Italicized Entries are Extrapolations Server Degradation with Offered Load The observed client throughput degradation is documented in the following table and graph. The systems tested are listed in order of best performance for the 5 load case. This measures the rate at which the server is slowing down due to increased load. The important item to note here is that the 917 FP is degrading slower than all other platforms. This would indicate that until the SPU reached near 100% CPU utilization that the 917 FP would be able to gain and surpass the performance of the 486 which was tested. This is good news for the scalability of Netware/iX on the high end Nova boxes where additional CPU is available to the product beyond the 917's capacity. LAN Labs PC Benchmark Tests Throughput Degradation as Compared to No Load 917 FP 486/25 386/25 817 837 720 807 917 SP No Load 100% 100% 100% 100% 100% 100% 100% 100% 1 91% 90% 89% 84% 86% 83% 86% 63% 2 85% 82% 76% 74% 74% 64% 65% 41% 3 83% 72% 72% 67% 66% 65% 50% 30% 4 78% 65% 64% 61% 60% 61% 39% 24% 5 77% 61% 60% 60% 56% 47% 32% 20% 6 77% 52% 54% 45% 10 72% 48% 43% Improvement The following graph shows that our improvement over the slow path product scales dramatically. There are two reasons for this. First, under the previous product, we consumed the 917 CPU with a single load, whereas the fast path code has CPU to spare. Further, the NIO LAN card latency problem, which would prevent even an infinitely fast SPU from matching the performance of the 486 is less apparent under load. Whereas the 486, with its fast LAN card will degrade at each new load, the bottleneck on MPE/iX is the LAN card and the CPU is able to keep up with the load in the tests. Interpretation Description of Netware/iX Fast Path Code. Briefly, the fast path code implements a scheme similar to native Netware itself to gain performance, specifically, large amounts of file data are cached in physical memory eliminating the need to access the file system for most reads and writes. On MPE/iX this is implemented as an extension to the LAN driver through a modification of the LAN Access (LA) interface. File reads and writes occur to the cached file data in physical memory rather than to the file on disc. As a result of the fast path code existing as an extension to the LAN driver, the access to the cached file data occurs on the Interrupt Control Stack (ICS), outside of the process environment of MPE, eliminating dispatcher calls and pre- empting from MPE. The net result of this is that access for reads and writes is very fast. In addition to the file access improvements, Quest modified their code to eliminate some of the previously known bottlenecks. These improvements were primarily in directory access, caching disc requests, and pre-allocating file opens. No changes were made to the SPX Interprocess Communication (IPC) or to the print spooling and sharing features. Description of test environment The full environment for testing is described in Appendix A below. In brief, there is a single PC, in this case an IBM PS/2 Model 70, which measures its individual throughput while other PCs add load to the server. The throughput of the measurement PC is the number reported above. As more PCs add load to the server, the server attempts to meet the increased load requirements, reducing the load provided to the measurement PC. This is seen as degraded throughput performance to the measurement PC reported above. Factors which may change the observed performance Memory allocated per client connection. One key factor to gaining the maximum performance is the amount of memory dedicated to each client connection. The amount of physical memory allocated to each client is globally configurable (one setting applies to all clients). Reducing frozen size below working set size will cause slower performance as the server goes into the process environment to update the working set. Two factors, this will slow the server down as the OS overhead is now much larger, and second, the other processes of the system with the same or higher priority will now be able to contend with the server. This may or may not be desirable to the tuning of the server and the system, and weighing the overall system goals. Single user throughput will not meet the single user throughput of the Intel server The latency of the NIO LAN card is greater than the entire LAN and processing time of the 486. This means that in the single user case, an NIO system will never meet the performance Conditions under which better performance will be seen Strictly read/write. Since the code for Netware/iX fast path deals only with the read/write path and files cached in memory, applications which open files and leave them open for a long time will have better performance than applications which open many small files and jump back and forth. Windows is an example of a program which opens many small files. Working Set of the Cached File Matches the Amount of Memory Allocated by Netware/iX. Since each client connection is allocated a block of physical memory for file caching, better performance will be seen when the size of these block of memory matches the working set of the file opened. For example if the file being accessed is 500KBytes, but only 50KBytes are consistently accessed, then allocating the small amount will provide a good match between the client requirements and server resources. However, if the entire file is being accessed, such as in the case of random access of a database file, then the larger amount would better suit the client access. One of the implications of a small memory allocation to the clients is heavier use of the "slow path" code to access the actual data file and bring it into memory. Two possible side effects would be increased CPU utilization and decreased throughput. Large reads (eg. loading data, program load) Large data reads, such as during a program load, or reading a large spreadsheet into a program, will provide better performance than many small reads. Large writes (eg. writing a data file) Large data writes, such as writing a document out of a word processor, will provide better performance than many small writes to several files. Conditions which will degrade performance results Random access of file beyond the amount of memory frozen for the client Where file access requires a "search" or where file access occurs outside of the data cached in memory, the "slow path" code is invoked resulting in slower performance and increased CPU utilization. Directory operations (eg. new directory, new file, moving files, copying directory structures) All directory operations use the "slow path" code which uses actual disc IO and is not cached. Heavy usage of directory calls, such as in copying an entire directory structure, will result in decreased performance. API usage No optimization of the SPX or NetBIOS APIs was done in this project. The APIs will use the previous version of code and will result in overall slower performance due to the increased load that they put on the system. Print Spooling No optimization of printer spooling was done in this project. Heavy use of the print spooler will result in overall slower performance due to the increased load that they put on the system. Extrapolations Scalability of Higher Capacity SPUs. Given that the throughput degradation on the 917 was directly related to the CPU consumption, we anticipate that Netware/iX will be highly scalable to higher capacity SPUs. The effect which the user will see however is not increased speed, but increased capacity. With the 967 SPU (the 2.3 Nova SPU), we would expect to see a commensurably lower CPU utilization and less degradation in the benchmark environment. This would result in increased capacity. Again, the NIO LAN card latency would prevent improved single user throughput from being seen on the higher capacity SPUs. Appendix A: Description of Test Platforms Load Measuring Client IBM PS/2 Model 70 8Mbytes RAM 3COM 3C523 LAN Card Vectra 486 Server, 25 Mhz1 Netware 386 version 3.11 12 Mbytes of RAM ISA ESDI disc controller2 1 EISA LAN card3 HP 3000 917 Server (This server used the Nova 1.0 board, although the chassis and memory configuration may have been closer to the standard 947 configuration.) Netware/XL A.01.01 based on Portable Netware 3.01b for Slow Path results Netware/XL A.01.08 002 based on Portable Netware 3.01b and MPE/iX 3.1 for Fast Path results 40 Mbytes of RAM 32 Kbytes I cache 64 Kbytes D cache 1 NIO LAN card HP 9000 807 Server Netware/9000 based on Portable Netware 3.01b 48 Mbytes of RAM HP 9000 720 Server Netware/9000 based on Portable Netware 3.01b 48 Mbytes of RAM Load Producing Clients 1st load: Compaq 386/20 or Vectra 386/20N 2nd load: Compaq 386/25 or Vectra RS/25C 3rd load: Vectra QS/20 4th load: Vectra QS/20 5th load: Vectra ES/12 6th load: Vectra Classic loads 7 through 10 using whatever PCs we can find including a Compaq, Vectra 486, and Vectra Classics Appendix B: Description of Benchmark Tests Quoting from the benchmark documentation: "The PC Magazine LAN Benchmark tests exercise and evaluate networks using tasks identical to those performed by real applications. Please note that the results you get with these and all other performance tests are only comparable to other tests run under identical conditions. The CPU power of the PCs [both servers and clients] running this software is an often- overlooked factor that is critical to the results you receive. When you run these tests, one network client PC runs the Evaluation Module of this program while all other PCs run a selection from the Load Module. This architecture provides a measurement of what one user would experience on a crowded network. Remember that just two or three fast PCs running Load programs can use up a large percentage of the available bandwidth in any 10-16 megabit per second network." From other sources, we've determined that one load client is roughly equivalent to 10 to 20 "real" clients on the network. The load clients are running far faster than any actual user would be able to work. The best way to look at these numbers is to compare similar test conditions across platforms. However, before making direct comparisons with other PC Magazine results you should remember that the mix of slow and fast clients will affect the results seen at the server. In order to gain 100% accuracy, the exact test setup of the article should be duplicated. For example, a slow client will make a slow server look better since throughput is measured at the client and a slow client would be unable to take advantage of the full potential of a fast server. It is our opinion that these results are sufficient for internal use and comparisons between HP's products and numbers quoted in PC Magazine. However, more accurate testing would be required if we were going to publish these results. Bibliography: Other Sources for NOS Performance Information DEC VAX as AppleShare Server - MacUser (an article that appeared sometime within the last six months). The VAX was far slower than the Macintosh II used as a server. Other Portable Netware Implementations - PC Week; August 19, 1991. The article compared Portable Netware running on NCR, Prime, and Altos Unix systems. The general results were that the Compaq 486/33Mhz was running about 5 times faster than the Unix systems. Note that these tests were not the PC Magazine tests which are described here. LAN Manager 2.0 versus Netware 386 3.1 - PC Magazine; December 11, 1990. This article compared the latest versions (as of that writing) of Netware and LAN Manager. The results shows the Netware platform (Compaq dual 386/33) under no load performing at about 281Kbytes/sec, and LM 2.0 at 200Kbytes/sec. Under 5 loads Netware performed at 270Kbytes/sec and LM 2.0 at 183Kbytes/sec. The difference between these numbers and the test results are in the servers used (Compaq dual 386 versus Vectra 486), version of Netware (3.1 versus 3.11 which did improve performance), and the difference in clients. Netware 386 as a Macintosh AppleShare Server - MacUser; November 1991. Netware 386 running on a Compaq 486/33 was 6 times faster than a Macintosh IIfx (the current high end Macintosh as of the writing the article) in the same role. Intel based Superservers - Data Communications; "Can Superservers Scale Up to Enterprise Status?", July 1991. Comparing the then current list of "superservers" and whether they can scale up to handling the data load then being handled by mini- and mainframe computers. 1Novell recommends a 33Mhz 486 as a departmental LAN server. 2This is considered to be a relatively slow disk. If further testing is requested, we should move to a SCSI based system. Novell recently reported on a Vectra 486/33Mhz with 32bit SCSI controllers at the Netware for Unix conference in January. 3This is considered the "fast" LAN card.